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Abstract 

High throughput and high content screening involve determination of the effect of many compounds on a given 
target. As currently practiced, screening for each new target typically makes little use of information from screens of 
prior targets. Further, choices of compounds to advance to drug development are made without significant screening 
against off-target effects. The overall drug development process could be made more effective, as well as less 
expensive and time consuming, if potential effects of all compounds on all possible targets could be considered, yet 
the cost of such full experimentation would be prohibitive. In this paper, we describe a potential solution: probabilistic 
models that can be used to predict results for unmeasured combinations, and active learning algorithms for efficiently 
selecting which experiments to perform in order to build those models and determining when to stop. Using simulated 
and experimental data, we show that our approaches can produce powerful predictive models without exhaustive 
experimentation and can learn them much faster than by selecting experiments at random. 
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Introduction 

It is increasingly accepted that the study of biology requires a 
paradigm shift from a reductionist framework to a complex 
systems approach [1-3]. Reductionist frameworks implicitly 
assume that the object of study is comprised of a finite set of 
subsystems, each functionally and essentially physically 
distinct. In this case there is a reasonable upper bound for the 
total number of experiments necessary to characterize the 
whole, one experiment per component per subsystem. For 
complex systems the upper bound on the total number of 
experiments is the number of ways in which the components 
can be taken in combinations up to some maximum number 
per experiment (ten thousand components even taken only five 
at a time would require over 10 17 experiments). 

This problem is manifest when trying to determine the effects 
of potential drugs on complex systems, since drugs with 
desired effects often have undesired side effects. It has been 
argued that these constitute the greatest component of risk in 
drug development since unforeseen deleterious behaviors are 
costly to correct [4,5]. The only way to be sure that a drug does 



not have side effects is to measure its effect in assays for all 
potential targets. Since explicit characterization in this manner 
is infeasible, approaches that do not require exhaustive 
experimentation need to be considered [6]. To do this, we must 
assume some structure or correlations exist within the 
complete data, and that predictive models can be used to 
capture them and guide future experimentation. Algorithms for 
this type of problem are termed Active Learning in the machine 
learning literature [7-10]. There have been limited applications 
of these methods to biological problems [11-15], but none in 
the context of multi-target, multi-drug analysis. Furthermore, 
the methods we present here are equally applicable to more 
general conditions than just drugs. In this paper, we show in 
extensive computational experiments that a combination of a 
structure learning method and active learning can achieve high 
accuracy of prediction of condition-specific effects on targets 
with significantly fewer experiments than a random learner, in 
many cases with perfect accuracy without exhaustive 
experimentation. The experiments were done with both 
synthetic and experimental data. Further, we provide a method 
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Figure 1. Active Learning Process. (A) An experiment is a combination of a target and a condition; observed experiments (filled 
circles) associate a target and condition with a vector encoding an experiment result. (B) Phenotypes (filled colored circles) are 
identified by cluster analysis of the experiment results. (C) From the arrangement of phenotypes across targets and conditions, a 
small set of correlations <\> (distributions of phenotypes across targets) are identified which are then used to impute unobserved 
experiments. (D) A batch of experiments (filled grey circles) is selected based in part on predictions (outlined colored circles) from 
the identified correlations. The process (B-D) is repeated until a desired goal is met. 

doi: 10.1371/joumal.pone.0083996.g001 



for learning when to stop experimentation, a critical step for 
practical use of active learning. 

Materials and Methods 
Definitions 

We consider a general problem consisting of finite sets of 
targets and conditions, combinations of which define an 
experiment, whose outcome is an experimental result (Figure 



1). This is expressed as a categorical phenotype, and we are 
interested in knowing the phenotype for all possible 
experiments. 

The inputs to the learning procedures considered here are a 
set of targets T, discrete conditions C and a procedure F which 
is used to form phenotypes from a space of observations O; T 
and C are fixed and finite. Observations arise by performing 
experiments taken from TxC (the experiment space). 
Observations are interpreted by F to produce categorical 
phenotypes F(O). Collectively, these define the experiment 



PLOS ONE | www.plosone.org 



2 



December 2013 | Volume 8 | Issue 12 | e83996 



Active Learning Discovery of Biological Responses 



result space n=TxCxF(0); for convenience we also define a 
function E which returns the experiment of an experiment 
result: £(w) = (t,c) when w=(f,c,o). 

The learners considered here do not initially assume that 
targets may be directly compared among themselves, nor that 
conditions may be directly compared among themselves. This 
allows us to consider potentially complicated experiment 
spaces. For instance, conditions may consist of addition of 
drugs, knockdown of gene expression, or changes in 
temperature - it is not clear how to directly compare (or 
express similarity between) temperature changes to drugs or 
drugs to gene knockdowns. Likewise, the targets may also be 
heterogeneous: some of the targets may be proteins, some 
may be RNAs and again it is not clear how to directly compare 
these. The phenotypes F(O) are therefore the sole basis of 
comparison: two experiments (f-i.c,) and (f 2 , c 2 ) are considered 
similar if they have the same phenotype. Various ways of 
extending this concept produces a way of measuring similarity 
of two targets across different conditions or vice versa. 

The learning process constructs a sequence of predictive 
models over £(0.) by iteratively performing batches of 
experiments; each step in the sequence is called a round of 
experimentation. We consider the case where experiments are 
acquired in batches of fixed size S; this models the case where 
it is cost-effective to perform several experiments at a time 
such as for high-throughput technologies. Each batch of 
experiments is disjoint to experiments already observed. The 
sequence of models progressively identifies nested subsets of 
O. (and E(Q)); after n rounds of experiments the collected data 
are I n e O. 

At each round the structure learning problem is to identify a 
predictive model M„ (M n [I n ]). This may be used to propose a 
next batch of experiments B n+1 e E(0.)\£(I n ). Active learning 
strategies choose experiments based on observed data: 6 n+1 |I n 
■ f(I n ) for some function f, whereas a random learner ignores 
the dependence and uniformly samples S experiments from the 
remainder: B n+1 |I n - Uniform[£(Q)\£(I n )]. 

Structure Learning 

We introduce a model class which assumes that 
observations O are distributed in condition-specific manners. 
That is, we will estimate a set of distributions O, the size of 
which is re-estimated each round. Each distribution cp is a 
function from a subset of the targets T (called its "support") to 
the set of phenotypes F(O); for targets not in the support of a 
distribution, no phenotype is associated. For each condition c, 
there is at least one distribution that can make predictions for 
some of the targets. Informally, since several conditions can be 
associated with the same distributions, these correlations 
describe mutual predictions from one target-phenotype pair to 
another across conditions. From these we can build an 
asymmetric model of the distribution P[F(0) | (f,c)]. 

The conditional independence structure is encoded by a 
valuation f which indicates which distributions each experiment 
(f,c)GE(Q) depends on. For convenience, we assume an 
indexing of the distributions. Formally, a valuation r :7xC-> 
2 im maps an experiment to a set of indices over the 



distributions. Independence of two experiments e 1 ,e 2 ££(0) is 
expressed as disjoint valuations, P^] ± P[e 2 ] => T(e 1 ) n T(e 2 ) 
=0; informally this means that these two experiments were 
estimated to have their phenotypes by unrelated causes. A 
choice operator z resolves cases where an unobserved 
(predicted) experiment has multiple valuations (|T(e)|>1) to 
form coherent predictions; different z lead to different 
generalizations. 

Choices for these form a model M = (<t>,r,£). Predictions for 
an observed experiment w=(f,c,o) in I are produced through T: 

p [ F H4>H = W'] =F (°) 

In words, the predicted phenotype of an observed 
experiment is such that the valuation of the experiment is a 
distribution that maps the target to the observed phenotype. 
Estimates for observed data do not depend on z. Predictions 
for unobserved (t,c) G £(Q)\E(I) are also constructed over O 
and r. To do this, for every condition we identify the 
distributions that could be used to make predictions for 
unobserved targets in that condition. These sets r (c> are given 
by the common refinement 

U r(t,c) 

(t,c)eE(l) 

Since the correlations in r (c) may make different phenotype 
predictions for the same target, the choice operator will pick 
one of them. Taken together, given a model M = (<t>,r,£), 
predictions (when they exist) are defined as 

'MM \<l>i\t\ if i = r(t,c) and <t,c)eE(Z) 
P F O t c M = \ 

\)\ ) | [*,{f] if i=e(rM) and (t,c)£E(z) 

These predictions may be augmented by various data 
imputation methods (described below). In their absence, we 
choose z to be the function such that we predict the most 
common correlation for each target to make a phenotype 
prediction. 

We considered two methods, a "Greedy Merge" and a 
Quantified Boolean Formula Satisfaction (QBF/SAT) [16] based 
estimation procedure termed "B-Clustering." 

Greedy Merge Structure Learning 

Greedy Merge produces O and l~ from data and a clustering 
of observations by iteratively combining condition-specific 
distributions under the assumption that some of the conditions 
affect all targets in the same ways. These are determined by 
iteratively computing model estimates M z = (<t> z , l~ z , z) which are 
monotone decreasing in the size of <t>. We considered two 
variants, one variant considered performs the first two steps 
below and the second variant, Greedy Merge which is used 
throughout our work, performs all three steps below. 

Initialization. Let M 0 = (O 0 , r 0 , z). Associate a <\> c with every 
c G C such that for all observed (f,c,o) G I, <|> c [f| = F(O). Set 4> 0 
to be the set of all <\> c , and r 0 (f,c) = c. This produces an initial 
model estimate where observed experiments are assumed 
conditionally independent if they differ in condition. 
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Merge Overlapping. To produce M z+1 from M z = (O z , l~ z , e), 
arbitrarily choose two different ,4), G O z such that their 
supports overlap and in the overlap predictions do not differ 
(<t)i[t] = ^[t] for t in the common support). Set fresh <Jj z = ct>i u ty. 
Replace <\> jt (J), with <\> z to make a new O z+1 . Likewise, update 
references to / and j in r with z. This step is iteratively applied. 
At termination there are no more overlapping cpi.cpj to merge 
and so M z distinguishes between two experiments e,, e 2 if the 
distributions they are assigned to in r differ in any target's 
phenotype. M z may produce identical predictions for some 
target t across two conditions a,, c 2 (P[F(0)|(f,c ( )] = P[F(0)| 
(t,c 2 )]) but treat them as conditionally independent events 
(l~(f,Ci) n T(f,c 2 ) = 0) if there is some other f where P[F(0)| 
(f ,C)] * P[F(0)| (f,c 2 )]. 

Merge Nonconflicting. This step is similar to Merge 
Overlapping, but the requirement that two distributions have 
common support is removed and any two nonconflicting 
distributions can be merged. 

B-Clustering 

An alternative procedure would be to define properties that 
are believed to describe "good" models of the data, and then 
use an efficient search procedure (a satisfiability solver) to find 
examples of those models. This is most helpful when it is 
unclear how to construct an algorithm that directly estimates 
models which will satisfy the desired properties. We considered 
the use of Quantified Boolean Formula (QBF/SAT) methods 
built using the MiniSat solver [17] to identify a model subject to 
constraints defining an optimum. In this framework, each 
observed target and phenotype pair is associated with an index 
of a distribution. This implicitly defines distributions (which map 
targets to phenotypes) as the collection of target and 
phenotype pairs with the same index. To do this, each unique 
observed target and phenotype pair (f,F(o)) is associated with a 
vector of literals v to which encodes in two's complement the 
index of a distribution in O (e.g. a binary encoding of a natural 
number). Legal assignments of each of these literals to true or 
false will define the distributions. The set of legal assignments 
is constrained by introducing logical formulas which encode 
different criteria. 

An example criterion is to constrain the choice of model such 
that each (t,F(o)) is described by exactly one distribution <t>,; 
ensures that each distribution predicts at most one phenotype 
per target, and that all occurrences of a particular target and 
phenotype pair must have a common cause. This is encoded in 
a per-target constraint SingleOwner(t) which asserts that for 
the set =[t] of all (f, F(o)) with the same target, their distribution 
indices v,„ must be different. 



Coobserved(t,o) 



SingleOwner 



■ A 



t,o , \t,o 



m 

2 



Another criterion (Coobserved(f,o)) is that for each 
distribution <t>„ each pair of distinct targets t,f in the support is 
coobserved at least once in some condition c. That is, we 
disallow distributions which make predictions that are totally 
unsupported by mutual observations. Let (3(f,F(o)) be the set of 
conditions that a pair (t,F(o)) was observed in. 



(v{v,^#v,.„. for all (f,</).\fi(t,o)nfi(t,o)\>0 and t?t})v 
(a { v to £ v t 0 , for all (f ,o') # (t,o)\) 



A third criterion restricts the valuations of each condition (r (c) ) 
to be disjoint, so that predictions of unobserved targets for 
each condition are always unique. 



Noncontradiction 



t,o 



t',o'\ 



=v,,„^v (V => 



(a{v^v,, 0 . if *es[f]}) 



Other conditions may be applied. The model estimate 
chosen is found by identifying the least number of distributions 
N such that the SAT solver finds a solution where all of the 
above hold: 



argminBN. 

N 



ASingleOwner(f) 
t 

A Noncontradiction(x,y) 

|/J(*)n/»Cv)>0| 



A | A Coobserved(t,o) A 



A v to <N\ 
Mes j 



Imputation as Model Augmentation 

Ordinarily data or model imputation methods attempt to 
correct situations where most data are available and only a 
very small set are missing at random. In these situations, it 
may be reasonable to impute missing data by marginal 
estimates. Our learning problem is diametric: most of the data 
are missing and not at random. We therefore chose two 
alternate imputation rules to augment the model. For each we 
modify £ to either be the unique imputed phenotype (if it exists) 
for some (f,c) or the imputation arising from the most common 
correlation for that f. However, we keep all possible imputations 
for each (f,c) in a relation / which maps from TxC to subsets of 
the phenotypes F(O). 

Target Equivalence Estimation 

A simple imputation procedure estimates equivalence 
classes of targets as measured by common or similar 
observations. If two targets agree in their observations 
everywhere that they are coobserved then we may reduce the 
model by associating the predictions of one with the other, 
possibly leading to a larger set of concrete predictions for both. 

Three-Point Imputation 

Deductive reasoning produces other structural assumptions. 
We can interpret each distribution (t)G(t> as an assertion that for 
any two distinct targets t,t in its support, whenever we observe 
in a condition c that one target t had phenotype (|)[t] we may 
predict that an unobserved experiment (f',c) has phenotype 
<t>[f]- If we iterate these predictions by assuming the largest set 
possible of them, we can potentially make many more 
predictions than are immediately justified by the model. 
Formally, for each distribution we form the relation 

R[q>l\(t,c)4=3f. r(t\c) and 3c'. r(t',c') and r(t,c') 

An experiment (f,c) is in R[<\>] if there was a way to obtain 
pairwise target predictions of as described above from some 
other condition c'. We write the transitive closure of R[<\>] as cl 
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R[<\>i; this relation captures the logical extension of <\>, to as 
many (f,c)G£(D) as possible by iterating until no new 
experiments are added. These are computed for each 
distribution <\> separately. We interpret the case where (f,c)Gcl 
R[<\>] as weak predictions: "the phenotype of experiment (f,c) 
might be <|>[t]." Since an unobserved experiment (f,c) can be in 
the closure of R for different distributions, it is sometimes the 
case that there are multiple and distinct weak predictions for 
that experiment. That is, if (r,c)G cl R[cb 1 ] and (f,c)G cl R[cp 2 ] it 
can be the case that <Jj [t]^<Jj 2 [t] . The set of unobserved 
experiments that have multiple weak predictions are where the 
model may be considered concretely uncertain as opposed to 
simply latent. 

Active Learner 

A batch learner sequentially proposes experiments for 
observation given observed data. At batch step n, given data 
I n , the following are provided: model M = M„[L] = (0,r,£), the 
collection of all possible imputations / and the model reductions 
R e 2 T used to form /. The goal is to balance choosing 
experiments amongst all those with imputations in /, and all 
possible refutations of identified correlations, taking into 
account any symmetry relationships induced by R and their 
refutations. Each unobserved experiment is given a rank 
reflecting the number of distinct imputed observations and 
through R, I and O forms a set system. The next batch S n+1 is 
computed as a weighted S-hitting set so as to minimize the 
number of experiments expected to be imputable from each 
other and to refute the greatest number of assumed conditional 
independences. 

Ranking Experiments and Symmetry Breaking 

We partition E(U„) into disjoint subsets, LP, if' where U' = 
E(U„)uE(l) and U NI is the remainder (slightly abusing notation 
for £). We form a lookup R which returns all the targets which 
are in the same model reduction equivalence class; if one was 
not estimated, then R is just the identity map. Let C u be those 
cGC with no observations in I n ; this set is usually empty after 
learner initialization. A weak association on C x 2 C is introduced 
in the following manner: for each c, let Q(c) be the relation that 
identifies those c'^c whose model predictions are equal for 
some tGT. Q(c) need not be symmetric and is always 
irreflexive. Q is used to break symmetry through R in batch 
selection by the relation W, which identifies those unobserved 
(f,c) with any (f ,c') such that c is weakly associated to c' (cRc') 
and the model predictions differ (P[F(0)|(r,c)]* P[F(0)| (f',c')]). 
In words, W marks those experiments which have shown any 
variation amongst similar conditions. 

Given the above, a rank z(f,c) is computed over E(U„). For 
each (f,c), define the pre-rank z' to be the number of 
imputations for (t,c) that have different phenotype 
predictions:z'(r,c)=|{<p,[f] for (r,c,<p)G/}|. Rank is defined as: 



experiments that have many possible imputations, and then 
those with no imputations and only then consider choosing 
experiments that have single imputations. 

Batch Selection 

From these ranks, a weighted S-hitting set is computed as B„ 
+1 so as to minimize the number of experiments expected to be 
imputable from each other through R and r (c) . This is done 
greedily, starting from the set of greatest rank, choosing an 
unobserved experiment uniformly at random, and then 
(temporarily) eliminating from consideration all those 
experiments reachable through R and then selecting a next 
experiment from the greatest nonempty rank set by repeating. 
If S many elements have not been selected, then the 
temporarily removed experiments are placed back into 
consideration and the selection process is again applied; this 
case generally only occurs when the apparent uniqueness of 
the data is very low. 

Learner Initialization 

The learning process initializes from an empty I 0 to request 

Irl+maxflrlJcl) , . , „ . 

many batches of experiments. These S x / 



Z\ t,c 



W(t,c)+l ifz'(r,c)=l 
W(t,c) + z'(t,c) + 3 otherwise 



Notice that this ranks all elements in LP' over experiments 
with a single concrete imputation. Informally this chooses 



|5| 

many experiments will cover two sample sets. The first is all 
targets under the unperturbed condition. The remaining 
initializing experiments consists of a scoreboard of max(|T|,|C|) 
points chosen such that each target and each condition is 
sampled at least once, with the possibility of padding points 
chosen at random to fill a complete batch B t . This starting 
choice for I; allows Target Equivalence Estimation to produce a 
maximal (but not necessarily accurate) upper bound 
equivalence reduction and observes every target at least twice 
which provides a reasonable initial minimum bound estimate of 
the number and partial identity of correlations. 

Parameterization of Experiment Problem Space 

A description of experimental spaces with an equal number 
N of targets T and conditions C can be parameterized in three 
terms 9=(m, A r , A u ) as follows. For convenience, fix an ordering 
of Tand C each over [N] with condition c=1 as the unperturbed 
condition. Influenced conditions C&2..N are perturbations from 
the unperturbed condition. Let m be the size of F(O). When the 
observation for a particular t differs in condition c#1 from 
condition c=1 we say that the experiment was responsive; let A r 
be the expected fraction of targets that are responsive. 
Different f may have identical response across C and likewise 
different c may similarly perturb 7"; let A u be the expected 
fraction of each of T,C that are unique up to equivalence 
through phenotypes. A r and A u are therefore rate parameters for 
a truncated Poisson distribution. 

A choice of 8 generates data Cl = D [9] by the following 
process. Let n T , n c be the number of underlying (to be 
replicated) targets and conditions respectively, nf=\(N- / \)A u 
+1land similarly for n c . For each unperturbed experiment (f,1) 
sample uniformly with replacement from [m]. Sample n c -1 
times from the truncated Poisson distribution to determine the 
number of responses per responsive condition. For each 



PLOS ONE | www.plosone.org 



5 



December 2013 | Volume 8 | Issue 12 | e83996 



Active Learning Discovery of Biological Responses 



condition c G 2..n c choose d c many indices in [n T ]; observations 
for these indices are set distinct from the unperturbed 
condition. The data are completed by sampling with 
replacement from [n T ] to fill out the N - n T many replicated T, 
and similarly for C. 

Predicted Accuracy Score Regression and Stopping 
Rule Construction 

To characterize a model learned at a particular batch, we 
measured several features on both that model and on 
differences between that model and the model learned at the 
previous batch. All of these features are based on data 
available to the model; in particular, the parameterization of 
data used was not included. These features fell into several 
broad categories. 

The first set of features measured simple counts: (1) the 
current batch number, (2) the number of distributions in the 
model, (3) the number of unique phenotypes observed, (4) the 
number of experiments whose (predicted) phenotype is in 
agreement between the previous model and the current model 
and (5) the number of experimental conditions that differ within 
a target. 

The next set of features measures aspects of the model as a 
Markov hypergraph system: (6) the minimum fraction of each 
current distribution that was observed in the previous batch a 
particular condition, (7) the maximum fraction as above (6), (8) 
the maximum of the fraction of current imputations or 
distributions that the previous batch covered (e.g. how good an 
^approximation the last model was to the current model) (9), 
the difference of the average number of each phenotype 
observed between the previous and current models and (10) 
the size of the maximal matching of distributions between the 
previous and current models. 

These features were combined with their pairwise products 
and z-scored and formed the design matrix for regression. The 
dependent variable was the measured accuracy was adjusted 
by subtracting the percentage of the population observed per- 
batch; this essentially removes the expected fraction of 
accuracy one would expect at random. The design matrix was 
regressed in logistic lasso [18] against the adjusted measured 
accuracy; the choice of regularization constant was determined 
by minimizing 10-fold cross validation (folds formed over the 
whole of the data). Loadings were computed by ordinary least 
squares fit using the nonzero features identified by lasso 
regression, and used to produce predicted accuracy scores 
from the design matrix. The resulting scores were then re- 
adjusted by adding back in the percentage of population 
observed per-batch and normalized so that the maximum was 
1.0 instead of -1.1. 

Gene Expression Analysis 

Normalized gene expression data were taken from the 
Connectivity Map [19,20] dataset (available at http:// 
lincscloud.org as of time of writing). The dataset consists of 
gene expression profiles in 48 cell lines under treatment by 280 
drugs. We identified a completely observed submatrix of 50 
highly drug-responsive genes (targets), 280 drugs (conditions) 
and formed phenotypes of the measured gene expressions 



across the 48 cell lines by /(-means clustering. To identify the 
50 genes, expression levels were z-scored per-gene and 
ranked by variance explained by 280 treatments (variance of 
gene expression levels conditioned on drug). The 50 genes 
most varying according to treatment were chosen so the 
resulting dataset was not trivial (i.e. there would likely be more 
than one phenotype) and to limit computational requirements 
for simulation. A (280x50, 48)-matrix of observations across 
cell lines was formed with averages of technical replicates and 
clustered with varying /(-means; for each k the model that 
minimized reconstruction error from 200 seeds was used. For 
each of these, a (280, 50)-matrix was formed from the 
phenotypes for the simulations to query. 

Availability 

Scripts for setting up the simulations and generating figures 
from the results are available from http:// 
murphylab.web.cmu.edu/software . Active learning software will 
be made available for non-commercial use upon request. 

Results 

Learning Problem 

As described in the Methods, we consider a general problem 
consisting of learning a model for the effects of different 
conditions upon different targets (the combination of which 
define an experiment) (Figure 1a). The result of each 
experiment is expressed as a categorical phenotype. Given 
some initial data, either in the form of phenotypes or other 
measurements from which we can obtain phenotypes (Figure 
1b), we learn correlations between the behaviors of targets and 
conditions that allow us to make predictions for unobserved 
experiments (Figure 1c). We then construct a batch of 
experiments to observe next in order to improve the model 
(Figure 1d). 

For this task, we considered different possible learning 
processes, each comprised of (i) a probabilistic model, (ii) a 
structure learning method for the model, (iii) a choice of data 
imputation methods and (iv) a choice of acf/Ve or random 
learning strategy along with (v) a stopping rule which gives an 
estimate for when a 'good enough' model has been learned 
(Methods). 

Model Selection 

In order to test the ability of the models described above to 
support active learning, we performed computational 
experiments for several model designs. For these simulations, 
we generated datasets consisting of m phenotypes for a set of 
targets and conditions. Each target was assigned a base 
(unperturbed) phenotype; the probability that a target would 
change phenotype for other conditions was given by a 
parameter A r ("responsiveness"). The extent to which targets 
showed the same responses across all conditions, and the 
extent to which conditions had the same effect on all targets, 
was controlled by a parameter A u ("uniqueness"). For 
illustration, A u =1 would correspond to all targets and conditions 
showing a unique combination of phenotypes, and A u =0.1 
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would correspond to an average of 10% of targets and 
conditions showing the same combination. 

We performed computational experiments for several model 
designs, each consisting of a choice between two structure 
learning methods (Greedy Merge and B-Clustering) with 
predictions augmented with one of four combinations of 
imputations. The simulations were evaluated for 100 targets 
and 100 conditions with parameterization 8=(m=8, A r =80%, 
A u =40%) with a fixed batch size of 100 (Methods). At each 
batch the best accuracy for either the random or active learning 
strategy was chosen as an indication of how well that design 
can perform. These are displayed in Figure 2A. Most model 
designs showed linear increase in accuracy with batches as 
would be expected for a model-free random sampler. Only five 
model designs showed learning that was superlinear. The 
batch-wise difference between active and random learning 
accuracies for these five designs are shown in Figure 2B. 
Different designs show peaks in improvement over random 
after different numbers of batches have been observed. 

Model Performance 

We then evaluated the performance of active and random 
learning methods for each of these model designs across a 
broad range of A r and A u for 32 phenotypes. We measured the 
difference in the number of batches required to achieve 100% 
predictive accuracy between active and random learning 
methods. As Figure 3A indicates, our active learning strategy 
with Greedy Merge structure learning achieved 100% 
predictive accuracy more rapidly than random learning over the 
majority of the sampled range of A u and A r , with qualitatively 
similar behavior for 90% accuracy (Figure 3B). The 
improvement is much less for B-Clustering (Figure 3C,D). 
However, as discussed below, there are cases where each 
method dramatically outperforms random sampling. 

Figure 4 shows example learning curves for specific 
combinations of A u and A r . The most striking conclusion 
(echoing Figure 2) is that the models learn much more rapidly 
than random sampling. Figure 4A shows a case that with a 
high A r and low A u . The initial models are poor in these cases as 
predictions from the unperturbed condition do not generalize 
well, but rapidly improve as correlations are learned, 
generalized and used to identify likely responsive experiments. 
The combination of the Greedy Merge model with active 
learning gives a perfect accuracy after only about 30% of the 
data have been sampled. By contrast, the "needle in the 
haystack" case in Figure 4B (small A r and large A u ) is initially 
predicted well by either learner with either structure learning 
method but further progress is slow and occasionally leads to 
poor models. Nonetheless, high accuracy is achieved before 
full sampling. Overall, while the efficacy of different active 
learning methods varies somewhat for different A u and A r 
values, the results of Figures 3 and 4 show a significant benefit 
in sampling with our active learners for the same number of 
batches as compared to a random learner in almost all cases 
(an important conclusion since A u and A r will not usually be 
known). 



Probability of Approximate Correctness 

One potential problem with using active learning to perform 
only selected experiments is knowing when to stop. We 
therefore asked if it is possible for an experimenter to estimate 
the predictive accuracy of an actively learned model without 
completing all experiments. One way to do this would be to 
form a prediction of the accuracy of a model and a confidence 
that measures how likely the true accuracy (which the 
experimenter does not know) meets or exceeds the predicted 
accuracy. 

We empirically evaluated this possibility for the Greedy 
Merge model by simulating a broad range of data with 
dimensions as before. These data were formed by randomly 
and uniformly sampling parameters in the cube (m=18..100, 
A=5..95%, Au=5..95%). For each of these, we measured 
features at every batch that described differences between the 
model learned at the previous and current batches. Features 
were limited to knowledge available to the learner at a 
particular batch and not reliant on unseen data, or on the 
parameters the data were drawn from. These features were 
then collected and regressed against the true model accuracy 
to produce a predicted accuracy score (Methods). 

The predicted accuracy score is in general a conservative 
estimate of accuracy, with the highest correspondences at 
higher true accuracies (Figure 5A). On the whole (Figure 5B) 
extremes in the true accuracy are identified with high 
confidence. A practitioner may then be confident that a model 
with a predicted accuracy score above -80% is almost certainly 
at least that good. Furthermore the per-batch and predicted 
accuracy score confidences (Figure 5C) are conservative 
estimates everywhere. As an example, for a model acquired 
early in the learning process (batch 10) if we obtain a predicted 
accuracy score of 70%, we can be -90% certain that the true 
model accuracy is in excess of 70%. Likewise, hard to learn 
cases are identified as such with low predicted accuracy scores 
or low confidence. With these a practitioner may choose a 
minimum target accuracy, or limit the total number of 
experiments performed, and still assert a quantitative bound on 
the accuracy of the model. 

Application: Learning the Effects of Drugs on Gene 
Expression Levels across Cell Lines 

In order to demonstrate the utility of this approach using 
experimental data rather than simulated data, we applied the 
Greedy Merge model to a dataset of gene expression profiles 
in 48 cell lines under treatment by 280 drugs. An unresolved 
issue is how to decompose these profiles into distinct 
phenotypes. To avoid justifying a specific choice, we 
considered a wide range of possible values (2.73) for the 
number m of distinct expression phenotypes and formed them 
by /(-means clustering. For a given number of phenotypes, we 
can calculate the average A r and A u . Figure 6 shows the 
improvement of Greedy Merge with Active learning over 
Random learning as a function of these average A r and A u 
values. Consistent with Figure 3, a 21%-40% reduction in the 
percent of experiment space required to achieve 95% accuracy 
was observed. 
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Figure 2. Learning performance dependence on model design: structure learning and imputation rule choice. (A) Each 
model design was evaluated with both active and random learners on two simulated 100 target x 100 condition datasets, each 
having eight phenotypes, 80% responsiveness and 40% uniqueness. For each model design the best average accuracy for either 
the active or random learner is plotted at each batch. For six cases displaying superlinear performance, structure learning methods 
are indicated in color, with different design variations plotted as separate lines and with filled circles to indicate batches where the 
active learner had higher accuracy: Greedy Merge (blue), a 'strict' variation of Greedy Merge (red), and B-Clustering (green, one 
design). These each had both Target Equivalence Class and Three-Point Imputation rules. (B) The difference in random and active 
learner accuracies for the superlinear model designs with structure learning method plotted by color as above; filled circles at tails 
indicate that the active learner had reached 100% accuracy. 

doi: 10.1371/joumal.pone.0083996.g002 
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Figure 3. Active learning performance for different model designs. Performance was measured as the difference in the 
number of batches to achieve (A,B) 100% or (C,D) 90% accuracy between active and random learning. (A,C) Greedy Merge, (B,D) 
B-Clustering. Warmer colors indicate greater experiment savings with an active learner. 

doi: 10.1371/journal.pone.0083996.g003 



Discussion 

We have described a learning approach suitable for the 
study of large, complex systems where the constituents have 
unknown or incomparable relationships. We have developed 
and presented empirical characterization of a class of models 
that capture the structure which target-condition dependence 
exhibits, structure inference algorithms for the class of models 
that are suitable for sparse data and methods for imputing 
missing values based on the structure of the learned models. 
Importantly, since different targets may be part of very different 
biological mechanisms, and yet have correlated responses in 
various conditions, the models capture patterns in the 
phenotypes without assuming a causal structure among the 
targets. From these we have described and evaluated a batch 
active learner capable of sequentially proposing informative 
experiments. Our results show that it is possible to learn highly 
accurate models without exhaustive experimentation. 

Critically, we have also shown that it is possible to produce 
an estimate of probable approximate correctness of the 
learning process without access to complete data. To the best 
of our knowledge, this is the first nontrivial active learner that 
(empirically) enjoys useful learning guarantees analogous to 
classical random sampling methods. This permits a decision 
about when an active learning process can safely be stopped. 



An important application of this work will be to efficiently 
identify and model the dependencies of cellular targets upon 
potential drugs or drug cocktails; we are unaware of previous 
methods approaching the efficiencies reported here. Towards 
this, we were able to show that the expression levels of genes 
across diverse cell types under different drugs can form 
consistent patterns whose emergent structure can be 
accurately and rapidly learned. Interestingly, our results 
indicate that while it is possible to learn efficiently even for the 
binarized case (two phenotypes), there are may be even 
greater efficiencies when considering finer granularity of drug 
responses. 

The learning problem here is similar to other well-studied 
problems. DNF formula learning [21] and multiarm bandit 
optimization [8] commonly consider categorical constituents 
and restrictions to equality comparisons. Furthermore, as with 
black-box optimization [22], we make very weak assumptions 
on the structure of the data and rely on nonparametric 
estimates. The tradeoff for weak data assumptions is that 
nonparametric methods are generally data biased predictors 
[23]. Close alternatives to our approach generally make 
parametric assumptions on the structure and topology of data. 
In particular, matrix completion [24,25] and similar regression- 
based methods are the natural extension of our models but 
require algebraic invariants on the marginal distributions of 
data [26,27]. We were motivated to explore the approaches 
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Figure 4. Example learning curves. Mean learning rates for active (solid) and random (dashed) learners across structure 
learning methods, Full Greedy Merge (blue) and B-Clustering (green). Data from experiments in Figure 3 for (A) (A=90%, A u =25%); 
(B) (A=10%, A u =70%). 

doi: 10.1371/joumal.pone.0083996.g004 



presented here as we thought they would perform better in 
cases with sparse, not missing at random data that would be 
expected to be obtained from an active learning process. 

Our formulation of the target-compound problem intentionally 
ignores any prior information about similarities among targets 



and among compounds (i.e., since they are potentially 
inaccurate). However, in separate work we have demonstrated 
that including it with active learning can increase the learning 
rate (Kangas, Naik, Murphy, submitted). The availability of both 
types of methods will be important to future work in this area. 
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Figure 5. Probability of approximate correctness over a broad range of data. (A) The empirical density of the correspondence 
between the predicted accuracy score and the true (latent) accuracy; lighter colors indicate greater frequency. (B) Confidence (in 
units of probability) per level set of predicted accuracy score. (C) Per-batch and (1% binned) predicted accuracy score confidences; 
color indicates confidence (in units of probability). 

doi: 10.1371/journal.pone.0083996.g005 
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Figure 6. Learning the effects of drugs on gene expression levels across cell lines. Gene expression levels of the genes that 
varied most across drug treatments were used to form experimental observations across 48 cell lines. Each point represents a 
different number of phenotypes, varying from two (bottom left hand point) to 73 (upper right hand point). Warmer colors indicate 
greater experiment savings with an active learner. 

doi: 10.1371/joumal.pone.0083996.g006 
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